Visualization (Exploring Co-variation)

Author

Peter Ganong and Maggie Shi

Published

October 14, 2024

DataTransformerRegistry.enable('default')

Table of contents

  1. Categorical variable and a continuous variable
  2. Two categorical variables
  3. Two continuous variables
  4. Graphics for production

Categorical variable and continuous variable

from palmerpenguins import load_penguins
penguins = load_penguins()
display(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

344 rows × 8 columns

numeric & categorical: box plot

numeric & categorical: mark_boxplot()

alt.Chart(penguins).mark_boxplot().encode(
    x=alt.X('species:N', title="Species"), 
    y=alt.Y('body_mass_g:Q', title="Body Mass (g)"),
).properties(
    width=400,
    height=300
)

Discussion question: what do you notice from this graph?

numeric & categorical: transform_density()

alt.Chart(penguins).transform_density(
    'body_mass_g',
        groupby=['species'], 
        as_=['body_mass_g', 'density']
    ).mark_line().encode(
        alt.X('body_mass_g:Q'),
        alt.Y('density:Q', stack=None), 
        alt.Color('species:N')
    ).properties(width=400,height=300)

numeric & categorical: transform_density()

Discussion q – What if we required the x-axis range to include zero? Would that improve or reduce clarity? How come?

alt.Chart(penguins).transform_density(
    'body_mass_g',
        groupby=['species'],  
        as_=['body_mass_g', 'density']
    ).mark_line().encode(
        alt.X('body_mass_g:Q', scale=alt.Scale(zero=True)),
        alt.Y('density:Q', stack=None), 
        alt.Color('species:N')
    ).properties(width=400,height=300)

numeric & categorical: transform_density() filled in

opacity=0.3 makes no difference in content; maybe a bit more elegant

alt.Chart(penguins).transform_density(
    'body_mass_g',
        groupby=['species'],  # Group by species for different density curves
        as_=['body_mass_g', 'density']
    ).mark_area(opacity=0.3).encode(
        alt.X('body_mass_g:Q'),
        alt.Y('density:Q', stack=None), 
        alt.Color('species:N')
    ).properties(width=400,height=300)

Two categorical variables

Two continuous variables

Two continuous variables: roadmap

  • movies ratings from Rotten Tomatoes and IMDB
  • diamonds: carat vs price

movies dataset

movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
movies = pd.read_json(movies_url)

Covariation: a first binned scatter plot

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
)

Suffers from overplotting!

use alt.Size('count()') to address overplotting

xy_size = alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Size('count()')
)
xy_size

use alt.Color('count()') to address overplotting

xy_color = alt.Chart(movies_url).mark_bar().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Color('count()')
) 
xy_color

Discussion question

xy_size | xy_color

Compare the size and color-based 2D histograms above. Which encoding do you think should be preferred? Why?

Summary: Exploring covariation

Scenario Functions
Categorical and continuous variable mark_boxplot()
transform_density()
Two categorical variables size
color
Two continuous variables alt.Size('count()')
alt.Color('count()')
mark_boxplot()
binscatter

Do-pair-share

We are now going to transition from making plots to teach ourselves to making plots for an audience.

Are penguins getting heavier (body_mass_g) over time?

Bonus: what is the headline of your plot and what are the sub-messages?

Do-pair-share solution I

alt.Chart(penguins).mark_bar().encode(
  alt.Y('average(body_mass_g):Q',  scale=alt.Scale(zero=False)),
  alt.X('year:N'),
  alt.Color('year:N')
)

This does answers the question, albeit in the most simple/boring way possible.

Do-pair-share solution II

alt.Chart(penguins).transform_density(
   'body_mass_g',
    groupby=['year'],
    as_= ['body_mass_g', 'density']
).mark_line().encode(
    x = 'body_mass_g:Q',
    y = 'density:Q',
    color='year:N'
)
  • Headline: 2007 is lightest, 2008 is heaviest

  • Sub-messages

    1. Similar shares of penguins above 5,000 grams in 2008 and 2009
    2. Average weight is higher in 2008 because 2009 has more lightweight penguins

Meta comment: iterating on plot design

“Make dozens of plots” – Quoctrung Bui, former 30535 guest lecturer and former Harris data viz instructor

What does he mean?

  • The first plot you make will never be the one you should show
  • As a rule of thumb, you should try out at least three different plotting concepts (marks)
  • Within each concept, you will need to try out several different encodings